The original data set under consideration contains 4,898 white wines with 11 variables quantifying the chemical properties of each wine. In addition, at least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The guiding question for this analysis is:
Which chemical properties influence the quality of white wines?
Furthur information regarding the dataset available at this link
To get an initial look at the wine data set, I will look at the variable names, structure, and summary.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "bound.sulfur.dioxide" "quality.level"
## 'data.frame': 4898 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ bound.sulfur.dioxide: num 125 118 67 139 139 67 106 125 118 101 ...
## $ quality.level : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
##
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
##
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
##
## alcohol quality bound.sulfur.dioxide quality.level
## Min. : 8.00 Min. :3.000 Min. : 4.0 3: 20
## 1st Qu.: 9.50 1st Qu.:5.000 1st Qu.: 78.0 4: 163
## Median :10.40 Median :6.000 Median :100.0 5:1457
## Mean :10.51 Mean :5.878 Mean :103.1 6:2198
## 3rd Qu.:11.40 3rd Qu.:6.000 3rd Qu.:125.0 7: 880
## Max. :14.20 Max. :9.000 Max. :331.0 8: 175
## 9: 5
This is a good starting point, but also a lot of numbers to digest. In the next section I will begin to plot the data in order to visualize it.
Because quality is our primary focus in this analysis, an important question to answer is: what is its distribution amongst the wines? A bar chart of quality will help visualize this. Also, to answer the question of exactly how many wines are in each quality bin, a table follows the chart.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
It can be seen that the vast majority of wines fall in the 5-7 range. Specifically, there are 4535 wines in the 5-7 range which is 92.59% of all the wines. The median of Quality is 6, and the mean is 5.878. The smallest bin is the highest quality wines at 9, of which there are only 5 (0.1%).
Similar to the above look into the distribution of quality, now a look at the distributions of the chemical properties.
##
## Fixed Acidity Median & Mean: 6.8 & 6.855
##
## Volatile Acidity Median & Mean: 0.26 & 0.278
##
## Citric Acid Median & Mean: 0.32 & 0.334
##
## pH Median & Mean: 3.18 & 3.188
##
## Free Sulfur Dioxide Median & Mean: 34 & 35.308
##
## Bound Sulfur Dioxide Median & Mean: 100 & 103.053
##
## Total Sulfur Dioxide Median & Mean: 134 & 138.361
##
## Sulfates Median & Mean: 0.47 & 0.490
##
## Chlorides Median & Mean: 0.043 & 0.046
##
## Density Median & Mean: 0.99374 & 0.994
Although none of the above chemical properties has exactly the same median and mean, they appear to have a relatively normal distribution.
##
## Residual Sugar Median & Mean: 5.2 & 6.391
There appears to be a large number of wines in the lowest residual sugar bin. It alsoo appears to show a positive skew.
##
## Alcohol Median & Mean: 10.4 & 10.514
The alcohol distribution appears relatively flat, except that it contains one unusually large bin.
For all of the chemical properties, the median is smaller than the mean.
This data set contains 4,898 white wines with 12 variables quantifying the chemical properties of each wine, and 2 others reporting the subjective quality of each wine.
The main feature is wine quality.
All of the other features that are chemical properties of the wines may help support the investigation into wine quality. That is the question under investigation.
Although I have no prior knowledge that it will have any effect I created a new variable of bound sulfur dioxide. It seemed an obvious variable to create with the free and total sulfur dioxide already being present in the data.
I also created another variable of “quality level” which is simply quality as a feature. This was done to make certain graphs easier to create going forward.
Residual sugar appears to have many values in the lowest bin.
According to Wikipedia:
“Even among the driest wines, it is rare to find wines with a level of less than 1 g/L, due to the unfermentability of certain types of sugars, such as pentose.”
It was determined how many of the wines measured residual sugar of less than one:
## [1] 15
Histogram of wines with residual sugar >= one:
This histogram is similar to the original, so it was determined not to remove any data from the set.
Residual.sugar skews positive, while most other histograms resembled a somewhat normal distribution. To gain further perspective a log transformation was done on residual sugar:
The log scale for residual sugar looks somewhat bimodal
The alcohol histogram looked somewhat flat, so log and square root transformations were done.
Log transformation:
Square root transformation:
After the transformations, the alcohol histogram is still somewhat flat, although gives the impression of a positive skew.
Now that we have had a look at the variables in the data set individually, the next step will be to begin to look at how the variables relate to each other.
For an initial quick overview of the relationships betweens the variables in the data set, a correlation matrix is a good starting point.
## The variables shown in the correlation matrix below are in
## the following order:
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "bound.sulfur.dioxide"
To get more detail, and begin to answer the question of how the chemical properties of the wines relate to quality, the next section will look at scatter plots comparing the chemical properties with quality and the correlation between each chemical property and quality.
## Correlation of Quality and Fixed Acidity: -0.1136628
## Correlation of Quality and Volatile Acidity: -0.194723
## Correlation of Quality and Citric Acid: -0.009209091
There appears to be little to no relationship between citric acid and quality.
## Correlation of Quality and Residual Sugar: -0.09757683
## Correlation of Quality and Chlorides: -0.2099344
## Correlation of Quality and Free Sulfur Dioxide: 0.008158067
There appears to be little to no relationship between free sulfur dioxide and quality.
## Correlation of Quality and Total Sulfur Dioxide: -0.1747372
## Correlation of Quality and Bound Sulfur Dioxide: -0.2178678
## Correlation of Quality and Density: -0.3071233
The relationship of density and quality shows the largest negative correlation. It can also be seen above that there are somewhat significant negative correlations between quality and chlorides, bound sulfur dioxide, and total sulfur dioxide.
## Correlation of Quality and pH: 0.09942725
## Correlation of Quality and Sulphates: 0.05367788
## Correlation of Quality and Alcohol: 0.4355747
Alcohol’s correlation with quality was the one with the highest magnitude.
To get a better look at the distribution of the chemical properties in each quality bin, box plots are shown next. These box plots will be a nice visual of some of the numerical data presented above in the “Data Summary”" section.
In the previous section of scatter plots, the citric acid/quality correlation line appeared flat. In looking at the box plot there is a suggestion that the highest (9 rated) quality wines may have a bit more citric acid. It must be kept in mind that there are only 5 wines in that bin.
The quality/alcohol box plot appears to suggest that the positive correlation between quality and alcohol shown in the scatter plot section may not hold for the lowest (3 & 4 rated) quality wines.
There were 4 different measures of acidity in the data: fixed acidity, volatile acidity, citric acid, and pH. In all cases, quality was inversely correlated with measures of acidity (lower pH readings mean higher acidity). The strongest negative correlations with quality were Chlorides, Bound Sulfur Dioxide, and Density. The most significant positive correlation to wine quality was alcohol.
The two strongest correllations amongst the chemical properties of the wines were between bound sulfur dioxide and total sulfur dioxide (0.9224823), and between residual sugar and density (0.8389665).
The strongest relationship with quality was a positive one with alcohol.
In the next section, I will incorporate another variable into the plots. Specifically to look at the question of how combinations of chemical properties relate to quality.
The next 6 plots are scatter plots for pairs of 2 different chemical properties versus each other. To add the third dimension to the analysis, the color of the data points reflects the quality: lower quality wines have lighter colors, higher quality wines have darker colors.
It appears the lightest (lowest quality) area of the plot is where alcohol is low, and bound sulfur dioxide is high.
This plot shows the strong negative relationship between alcohol and density, with the low density/high alcohol wines in general having greater quality than high density/low alcohol wines.
It is interesting in the plot of free sulfur dioxide and citric acid to see that in the area where free sulfur dioxide is greater than 120, 8 of the 20 wines with a 3 rating appear, and all wines in this region are of lower quality.
Although the above 3 plots do not seem to give us any new information, they do reinforce the previous correlations found in the bivariate section.
The final multivariate plot shows two of the chemical properties faceted by quality. It shows an interesting relationship discussed below.
For this section I chose a number of pairs of chemical properties, and looked at how their relationship to each other changed with respect to the main feature of quality. Some of the observations showed no obvious changes with respect to quality (eg. citric acid vs. free sulfur dioxide). Other observations reinforced some of the previous bivariate observations. For example, the relationship between chlorides and quality. Looking at the two observations of alchohol vs. chlorides and citric acid vs. chlorides above, one can see that the highest quality wines have low chlorides as opposed to the lower quality wines. Also, in the the plots that include alcohol, it is obvious that higher quality wines tend to have higher alcohol.
I thought the most interesting interaction was seen in the plot of sulphates and bound sulfur dioxide faceted by quality. In the lower quality wines the relationship seems to be sloping upwards suggesting a postive relationship. As the wine quality increases the slope appears to flatten, and at the highest qualities looks to be flat, suggesting no relationship between sulphates and bound sulfur dioxide.
The guiding question for this analysis is: Which chemical properties influence the quality of white wines? The variable that correlated most strongly with quality was alcohol. This can be seen in the above scatter plots. The linear regression line is obviously postive sloping. As shown earlier the correlation coefficient is 0.4355747. The smoothed fit curve adds more information. It can be seen that the positive relationship does not appear to occur below a quality rating of 5. Above 5, the relationship is clear.
Another chemical property that showed a strong relationship with quality was negative one with density. This can be seen in the above histogram. The higher quality wines are seen in much larger proportion amongst the lower densities in the plot. One interesting observation in the histogram is that the lowest quality wines are in general distributed relatively evenly around the center.
This plot was chosen as an interesting observation and extension of plot 2. As seen in plot 2 (and previous bivariate analysis), in general higher quality wines have lower densities. This plot (plot 3) shows that too. Plot 3 also incorporates residual sugar, and it shows an interesting trend. Amongst those low density/high quality wines, they also appear to be the wines with a higher residual sugar.
With 4898 observations, this is a relatively large data set. With 11 variables quantifying the chemical properties of each wine, there seemed to be plenty of measurements to analyze. As opposed to the chemical measurements of the wines in the data set, wine quality was an entirely subjective measure. One could possibly criticize the data for that reason. Personally, I feel that there would appear to be no other way to measure quality than subjectively. The quality ratings were said to be “median of at least 3 evaluations made by wine experts”. Perhaps future data can be collected with more that 3 evaluations per wine.
The guiding question of this analysis was which chemical properties of white wines have an effect on wine quality. Both by determining correlation, and observing box and scatter plots, it was shown that the highest effect on quality was alcohol. Other important determinants were similarly found with a negative effect, specifically Density, Chlorides,and Bound Sulfur Dioxide.
I felt that it was important throughout the analysis to keep in mind the focus on the guiding question. For that reason, in the bivariate section all the chemical properties were plotted against quality, and in the multivariate sections I chose to analyze pairs of chemical properties with respect to quality. In hindsight, I feel those were good choices, and the analysis was better due to that focus.
I did have some difficulty in the multivariate section of the analysis. I tried to analyze many different pairs of chemical properties with respect to quality (some not shown in final analysis), but seemed to not find many interesting relationships. I also felt it important to have a multivariate plot in the final plots, still keeping in mind that it needed to focus on quality. It took some time to find a plot that was of interest and furthered the analysis. In the end, some interesting multivariate trends were found.
Were I to conduct any future analysis, It would be interesting to loosen the focus on quality, and learn more of the relationships amongst the many chemical properties.
This analysis was also an opportunity to learn the tools of the R programming language, specifically the ggplot2 package as a means to analyze a large set of data. Those tools proved to be a good resource for a data analyst to conduct an analysis. Of possible greater importance, it also appears to be an effective way to communicate those findings to others.